Data Verification in Etl Processes
نویسنده
چکیده
The ETL processes are responsible for the extraction of the data from the external sources, transforming the data in order to satisfy the integration needs and for loading the data into the data warehouse. On the other hand, in the data mining world, there is a special concern on using the metrics for efficient classification algorithms. One of these approaches is the one that uses metrics on partitions based on the Shannon entropy (or other forms of entropy), to study the degree of concentration of values. In this paper we show how this idea can be used in verification of the consistency of data loaded into the data warehouse by ETL processes. We calculate the Shannon entropy and Gini index on partitions induced by attribute sets and we show that these values can be used to signal a possible problem in the data extraction process.
منابع مشابه
METL: Managing and Integrating ETL Processes
Companies use Extract-Transform-Load (Etl) tools to save time and costs when developing and maintaining data migration tasks. Etl tools allow the definition of often complex processes to extract, transform, and load heterogeneous data into a data warehouse or to perform other data migration tasks. In larger organizations many Etl processes of different data integration and warehouse projects ac...
متن کاملManaging ETL Processes
ETL tools allow the definition of sometimes complex processes to extract, transform, and load heterogeneous data into a data warehouse or to perform other data migration tasks. In larger organizations many ETL processes of different data integration projects are accumulated. Such processes can encompass common sub-processes, shared data sources and targets, and same or similar operations. Howev...
متن کاملRequirements Analysis Method For Extracting-Transformation-Loading (Etl) In Data Warehouse Systems
The data warehouse (DW) system design involves several tasks such as defining the DW schemas and the ETL processes specifications, and these have been extensively studied and practiced for many years. The problems in heterogeneous data integration are still far from being resolved due to the complexity of ETL processes and the fundamental problems of data conflicts in information sharing enviro...
متن کاملBPMN-Based Conceptual Modeling of ETL Processes
Business Intelligence (BI) solutions require the design and implementation of complex processes (denoted ETL) that extract, transform, and load data from the sources to a common repository. New applications, like for example, real-time data warehousing, require agile and flexible tools that allow BI users to take timely decisions based on extremely up-to-date data. This calls for new ETL tools ...
متن کاملA UML Based Approach for Modeling ETL Processes in Data Warehouses
Data warehouses (DWs) are complex computer systems whose main goal is to facilitate the decision making process of knowledge workers. ETL (Extraction-Transformation-Loading) processes are responsible for the extraction of data from heterogeneous operational data sources, their transformation (conversion, cleaning, normalization, etc.) and their loading into DWs. ETL processes are a key componen...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007